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The increase of bodyplan complexity 
in early bilaterian evolution is corre- 
lates with the advent and diversification 
of microRNAs. These small RNAs guide 
animal development by regulating tem- 
poral transitions in gene expression 
involved in cell fate choices and transi- 
tions between pluripotency and differenti- 
ation. One of the two known microRNAs 
whose origins date back before the bilate- 
rian ancestor is mir-100. In Bilateria, it 
appears stably associated in polycistronic 
transcripts with let-7 and mir-125, two 
key regulators of development. In verte- 
brates, these three microRNA families 
have expanded to form a complex system 
of developmental regulators. In this con- 
tribution, we disentangle the evolutionary 
history of the let-7 locus, which was 
restructured independently in nematodes, 
platyhelminths, and deuterostomes. The 
foundation of a second let-7 locus in 
the common ancestor of vertebrates and 
urochordates predates the vertebrate- 
specific genome duplications, which then 
caused a rapid expansion of the let-7 
family. 



Introduction 

As a class, microRNAs exhibit several 
unusual features in their evolutionary 
history. Despite the short length of the 
functional mature microRNAs (miRs) 
with only -22 nt, they are extremely well 
conserved and hence detectable with high 
accuracy in genomic sequence data. 1 Since 



the class of microRNAs is subdivided into 
hundreds of families with presumably 
independent origins, the presence/absence 
of at least one representative of a family 
forms a set of valuable and phylogeneti- 
cally highly informative characters. 2 ' 3 
While the loss of entire families is rare 
overall, some clades, such as in the tuni- 
cate Oikopleura dioica, 4 have undergone a 
major restructuring of their microRNA 
repertoire. On the other hand, individual 
microRNA families, with a few exceptions, 
evolve much like other multi-gene families 
showing gains by duplications, lineage- 
specific losses of paralogs, and reflect the 
genome-wide duplication events. 3 ' 5 

Most microRNA families have only a 
few paralogous members making it fairly 
straightforward to resolve their evolution- 
ary histories in full detail. The most 
complex example studied exhaustively to 
date is the mir-17 cluster, comprising 
about 15 microRNAs with eight different 
miRBase numbers that can belong to 
three unrelated families. 6,7 Apart from 
repeat-derived microRNAs and the huge 
imprinted mammalian microRNA clusters 
that behave in a repeat-like fashion, there 
is only a single ancient class of microRNAs 
whose evolutionary history has remained 
poorly understood: the let-7 family. This 
may come as a surprise, given that Caeno- 
rhabditis elegans let-7 and lin-4 were the 
first microRNAs to be discovered 8 more 
than a decade before microRNAs were 
recognized as a general class of RNA 
regulators, 9 and despite the fact that its 
phylogenetic distribution has been the 
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subject of systematic investigation already 
a decade ago. 10,11 

Both let-7 and lin-4 direct temporal 
development in different larval stages of 
C. elegans.^ 1 While lin-4 miRNA does 
not seem to exist outside of Rhabditida 
(parasitic Roundworms), the let-7 miRNA 
is conserved throughout Bilateria. Outside 
of Rhabditia, it is clustered with mir-100 
miRNA, in Panarthropoda and Deutero- 
stomia this cluster is further extended by 
a mir-125 as third microRNA, whose 
sequence is unrelated to both let-7 and 
mir-100. Vertebrate genomes, on the 
other hand, typically contain a dozen or 
more let-7 paralogs, some in clusters with 
paralogous copies of mir-125 and/or 
mir-100 and some located in isolation, 13 
see also the data provided in miRBase 
Release 18.0. 14 Only part of this diversity 
can be accounted for by the genome-wide 
duplication events at the origin of verte- 
brates. 15 The major difficulty in deriving a 
comprehensive picture of the evolution of 
the let-7 family is the correct assignment 
of orthology; unfortunately, the naming 
conventions used by miRBase are not 
helpful and even misleading in some cases. 
In addition, the annotated set of animal 
let-7 sequences is still rather incomplete. 

In this contribution, we therefore char- 
acterize the evolution of the let-7 family 
and its associated microRNAs based on 
a comprehensive homology search and a 
careful analysis of their orthology rela- 
tionships by means of both sequence 
comparison and assessment of synteny. 
We further suggest a refined nomenclature 
for the members of the let-7 family 
that better reflects their evolutionary 
relationships. 

Materials and Methods 

Starting point of our analysis was the 
collection of all lin-4, let-7, mir-100, and 
mir-125 precursor sequences compiled in 
the miRBase database release 16. This 
includes 250 let-7, 86 mir-125, and 85 
mir-100 sequences in deuterostomes as 
well as 33 let-7, 26 mir-125, 27 mir-100, 
and 6 lin-4 microRNAs in protostomes. 

We performed a comprehensive homo- 
logy search in the genomic sequences 
available for 60 deuterostomes, 50 proto- 
stomes, and five cnidarians. Therefore, we 



extended all miRBase-derived microRNA- 
precursor sequences to a uniform length. 
Subsequently, BLAST 16 was applied with 
these sequences to search the genomes 
of 60 deuterostomes in order to collect a 
comprehensive data set. In cases where 
BLAST was not able to detect a certain 
homolog, we used the semi-global se- 
quence alignment tool GotohScan 17 and 
finally Infernal. 18 

In order to firmly establish orthology 
relations, we determined for each micro- 
RNA gene its genomic context: for 
intergenic microRNAs, we recorded both 
adjacent protein-coding genes, for intronic 
locations, we recorded the surrounding 
genes. Homology of these protein-coding 
genes was established by alignments of 
the amino acid sequences whenever the 
corresponding information could not be 
retrieved from a database. Synteny among 
teleost species was determined from the 
pairwise alignment nets 19 provided through 
the UCSC genome browser. Regions 
with a size on the order of 100 kb were 
visually inspected for this purpose. 

Taking into account the genomic loca- 
tions, synteny information, and conser- 
vation of the 5p-miR/3p-miR regions, we 
built alignments for each microRNA 
family and its subfamilies. MicroRNA- 
like hairpin structures and secondary 
structure conservation were checked 
using RNAfold and RNAalifold from the 
Vienna RNA package. 20 ' 21 

Analyses of deuterostome data were 
mainly based on manually curated align- 
ments, calculated by ClustalW. 22 For 
phylogenetic analyses, we refined these 
alignments to a selection of taxa contain- 
ing one member of primates (human), 
Laurasiatheria (dog), Metatheria (either 
wallaby or opossum), Protheria (platypus), 
Lepidosauria (anoles), Aves (either chicken 
or turkey), Amphibia (frog), and all avail- 
able teleost sequences. Different members 
of the same cluster were aligned indepen- 
dently and then concatenated to increase 
the signal-to-noise ratio. SplitsTree 23 was 
applied for a visual investigation and refine- 
ment of our microRNA family assignment. 

We estimated the phylogeny of the 
mixed clusters A, C, and D and the homo- 
geneous clusters E-J using the program 
MrBayes v3.1. 24 Therein, the mir-100 
family of cluster C served as outgroup. 



We used jModelTest 25 to select the best 
fitting nucleotide substitution model using 
the AIC. The JC and K80+I+G was 
selected for the mixed and homogeneous 
clusters, respectively. MrBayes was used to 
infer the posterior majority rule consensus 
tree along with posterior support for all 
internal branches using the respective 
evolutionary model and mainly default 
settings. The bayesian phylogenetic ana- 
lyses of mixed and homogeneous clusters 
were run twice in parallel with eight (seven 
heated) Metropolis-coupled Markov chain 
Monte Carlo chains for 10 000 000 and 
2 000 000 generations sampling every 
1000th and 100th iteration, respectively. 
The initial 2 500 000 and 500 000 gen- 
erations were discarded as burn-in during 
the estimation of the consensus trees of 
clusters A, C, and D and E-J, respectively. 
The trees are illustrated using 
Dendroscope. 26 All branches are labeled 
with their respective posterior support. 

The Infernal package 18 was used for 
structure- and sequence-based homology 
searches whenever sequence conservation 
alone seemed to be insufficient. Therein, 
the program cmbuild was utilized with 
standard parameters to derive a CM for 
every let-7, mir-100, and mir-125 (sub) 
family, given the corresponding cleaned 
alignment and its consensus secondary 
structure. In this scope, cleaned align- 
ments solely contain full-length 
microRNA sequences that do not com- 
prise any nucleotide except of A, C, G, or 
T. In addition, we created a compound 
mir-100 CM based on a manually curated 
alignment of mir-100 sequences derived 
from clusters A, C, and D for searches in 
Protostomia and basal Metazoa. Besides 
homology searches, these family models 
were used to determine the structural and 
sequence similarity of the let-7 families 
in contrast to the phylogenetic analysis 
that was done with MrBayes. For each 
let-7 family, we used all its microRNA 
sequences to calculate the average bitscore 
against each family CM. 

Results 

Our survey resulted in a single copy of 
let-7 in Protostomia, 14 let-7 genes in 
human, and 19 let-7 copies in teleosts, 
except for the zebrafish (Danio rerio) 
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where 21 genes could be retrieved. We 
note that mir-99 has long been known as 
a homolog of mir-100, while mir-98 is a 
let-7 homolog, see e.g. Roush and Slack. 13 
In the following, we discuss the evolution- 
ary history of the let-7 system in detail. As 
a resource, we provide extensive supple- 
mental information in machine readable 
form, including covariance models, struc- 
ture-annotated multiple sequence align- 
ments, and the genomic coordinates of all 
microRNAs discussed in this work (http:// 
www.bioinf.uni-leipzig.de/publications/ 
supplements/ 1 1 -022) . 1 

Let-7 microRNAs in basal Metazoans. 
It is well known that miRNA mir-100 is 
one of the oldest miRNAs in animal 
species. The most ancient organism that 
has a mir-100 copy encoded in its genome 
is the sea anemone Nematostella vectensis, 
a cnidarian. 27 Somewhat surprisingly, no 
mir-100 ortholog was detectable in any of 
the diploplast genomes (cnidaria, cteno- 
phorans, and poriferans), although this 
might be explainable by the incomplete 
status of these genome projects. 

The let-7 microRNAs were detected 
by northern blot in a wide variety of 
both deuterostomes and protostomes, but 
remained undetectable in diploblasts. 11 
Consistent results were obtained by 
microRNA sequencing. 28 Chaetognatha, 
which sometimes have been hypothesized 
as pre-dating the protostome-deuterostome 
divergence, 29 also have a let-7 homolog. 11 
More recent phylogenetic studies, how- 
ever, place them firmly within Protosto- 
mia. 30 ' 31 No trace of a let-7 or mir-125 
homolog can be found in any of the non- 
bilaterian animal genomes. 

Let-7 in Protostomes. In most major 
protostome clades, we find a single intact 
cluster of mir-100, let-7, and mir-125. 
Typically, the cluster is tightly linked 
indicating an intact polycistronic trans- 
cript. This is the case for both lopho- 
trochozoans and ecdysozoans with some 
exceptions. 

Among lophotrochozoans, complete 
and tightly linked clusters are found in 
the annelids Capitella teleta and Platynereis 
dumerilii. In the latter, the expression of 
the let-7 cluster is studied in detail. 32 In 
the mollusc Lottia gigantea, the mir-125 
homolog is missing. In contrast, the cluster 
has desintegrated in platyhelminthes and 



mir-100 appears to be missing completely. 
Schistosomes have a single copy of let-7 
and two mir-125 paralogs. 33 " 35 In Schmi- 
dtea mediterranea, multiple copies of let-7 
and mir-125 as well as a single copy of 
lin-4 have been annotated. 36 " 38 In this 
scope, lin-4 can be seen as a putative 
homolog of mir-125. Both microRNA 
families show perfect conservation of their 
seed sequences, i.e., either nucleotides 1-7 
or 2-8 of the 5p-miR region, compared 
with human let-7 and mir-125 miRs, 
respectively. Several substitutions are 
encountered in the remaining part of the 
5p-miR sequences, however. Hence, the 
assignment of platyhelminth let-7 and 
mir-125 paralogs to particular subfamilies 
remains inconclusive. 39 

Much more genomic data are available 
for Ecdysozoa. In arthropods, mir-100, 
let-7, and mir-125 form a tight genomic 
cluster in which the microRNAs are 
separated only by a few hundred nucleo- 
tides. In Drosophila melanogaster, the poly- 
cistronic primary transcript and its expres- 
sion has been studied in detail. 40 In a few 
species, one of the cluster members is 
lacking, possibly due to missing data. In 
nematodes, an intact cluster is present 
only in Trichinella spiralis, i.e., in the most 
basal clade Dorylaimia. In contrast, most 
rhabditid worms including Caenorhabditis 
elegans have an isolated let-7 gene and 
lack annotated mir-125 and mir-100 
homologs. The loss of mir-100 appears 
to be a relatively recent phenomenon in 
Caenorhabditis and Pristionchus, 1 since 
a mir-100 linked to let-7 can be found 
in Heterorhabditis bacteriophora. Ruby 
et al. proposed, based on a match of the 
seed sequence, that the two related 
microRNA clusters mir-51/mir-53 (Chr. 
IV) and mir-54/mir-55/mir-56 (Chr.X) 
are co-orthologs of miR-100 in C. ele- 
gans. 42 The cluster on the Chr.X is 
separated from cel-let-7 by more than 
1.5 Mb. Beyond the seed nucleotides, 
no homology with mir-100 is detectable, 
however, so that their relation with mir- 
100 remains uncertain. 

Clusters comprising mir-100 and let-7 
are also found in Tylenchina (e.g., 
Meloidogyne, Heterodera) as well as 
Spirurina {Brugia malayi and Ascaris 
suum). Poole and colleagues reported four 
mir-100 paralogs in B. malayi,^ only one 



of which, bma-mir-lOOb, is linked with 
the sole annotated let-7 and, furthermore, 
shows perfect sequence conservation with 
the human miR-100. The remaining three 
microRNAs show a conserved seed region 
but comprise various mutations in the 3' 
end of their 5p-miR sequence. None of 
these genomes contain a mir-125. 

Lin-4, one of the first microRNAs to 
be discovered, 44 is functionally closely 
associated with let-7. 45 It was recognized 
as a putative ortholog of mir-125 by 
Lagos-Quintana et al. 46 based on the 
similarities of the 5p-miR regions. We 
find that the sequence homology covers 
the majority of the precursor hairpin 
supporting the homology of lin-4 and 
mir-125. In contrast to mir-125, however, 
none of the annotated lin-4 sequences 
is linked with let-7 and/or mir-100. C. 
elegans lin-4 is located in intron 9 of 
the protein-coding gene F59G1.4. This 
arrangement is conserved in both Pristion- 
chus pacificus and B, malayi. No lin-4 
sequence is detectable in T. spiralis, 
which has an intact mir-100/let-7/mir- 
125 cluster. 

In C. elegans, there is an antisense 
transcript of lin-4, which could also give 
rise to a miRNA similar to the iab-4/iab-8 
pair in Drosophila. 47 Thus, we checked 
whether lin-4 might originate from a 
mir-125 antisense hairpin. Comparisons 
of a lin-4 CM against annotated mir-125 
sequences and a mir-125 CM against 
lin-4 sequences in both reading direc- 
tions show that mir-125 and lin-4 match 
significantly better in sense direction. We 
thus hypothesize that the sequence diver- 
gence of lin-4 is coupled with the breaking 
up of the ancestral cluster. 

Let-7 in Gnathostomes. In total, we 
collected 874 microRNAs among Deutero- 
stomia including 128 mir-100 sequences, 
135 mir-125 sequences, and 611 let-7 
sequences. Most of these sequences were 
found in Gnathostomata (jawed verte- 
brates). The miRBase lists 12 let-7 paralogs 
in human including three genomic loci at 
which let-7 appears to be accompanied 
by other microRNAs. The best-known of 
these clusters, A, is composed of mir-99b, 
let-7e, and mir-125a on chromosome 19. 
The two other loci are C: mir-99a, let-7c, 
mir-125b-2 (chr.21) and D: mir-100, let- 
7a-2, mir-125b-l (chr. 11). The association 
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of mir-125 and let-7 at the latter two 
loci, although previously noticed, e.g., 
Roush and Slack, 13 is not annotated as a 
cluster in miRBase, since the distances of 
50 and 46 kb, resp., are larger than the 
(arbitary) lOkb threshold. The second 
type of let-7 cluster consists of members 
of the let-7 family only. The paradigmatic 
example is the cluster E on chromosome 
9, consisting of let-7a-l, let-7f-l, and let- 
7d. The two remaining clustered loci are 
F on chr.X (let-7f-2 and mir-98) and G 
on chr.22 (let-7a-3 and let-7b). Two 
additional loci, designated here as I 
(chr.12) and J (chr.3), each harbour a 
single annotated human let-7 miRNAs 
(let-7i and let-7g, resp.). Our homology 
searches revealed two additional sequences 
similar to let-7d, located at positions K 
(chr.17) and L (chr. 1), i.e., unrelated to 
the previously described let-7 loci. 

With the exception of the novel loci 
K and L (see below), this arrangment of 
let-7 paralogs is well conserved: The three 
mixed clusters A, C, and D, the three 
homogeneous clusters E, F, and G, and 
the two isolated loci I and J can be 
traced throughout all available tetrapods. 
A summary of these gnathostome let-7 
clusters is compiled in Table 1 , the cor- 
responding gene phylogenies are shown 
in Figure 1 . An extended table showing 
annotated miRBase names and distances 
between adjacent microRNAs is also 
available online. 

Orthology of the corresponding loci is 
unambiguously established based on 
both synteny information and sequence 
similarity (see Methods for details). In the 
chicken genome, two additional clustered 
let-7 paralogs, let-7k and let-7j, were 
reported. 48 They clearly form a fourth 
homogeneous cluster, H, absent in 
eutheria and metatheria. Evidence for the 
presence of the D, E, G, H, and J loci 
can also be found in the genome of the 
elephant shark. Since this genome is 
sequenced only at low coverage, 49 it is 
plausible that missing loci are due to lack 
of data rather than true losses. 

The genomes of nearly all vertebrates, 
more precisely of the gnathostomes to the 
exclusion of lampreys and hagfishes, share 
two rounds of genome duplications. 50 " 52 



Table 1. Overview of miRNA clusters among Gnathostoma ordered by their presumed evolutionary 
history 
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Previously annotated miRNAs (mirBase 18) are depicted by filled circles, newly found putative miRNAs 
are shown as empty circles. Dashed lines separate different lineages: Primates + Tupaia (1), Glires (2), 
Euarchontoglires (1+2), Laurasiatheria (3), Afrotheria (4), Xenarthra (5), Eutheria (1-5), Metatheria (6), 
Sauropsida (7), Teleostei (8). The two paralogous sets of clusters are separated by the long-dashed line. 
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Figure 1 . Estimated phylogenetic trees for the let-7 miRNA sequences. The two trees contain selected sequences from clusters A, C, and D (left) and E to J 
(right). The numbers at branches indicate the posterior support. Subtrees that are specific for teleosts (label-prefix "teleost") or non-teleosts (no label- 
prefix) were collapsed to increase readability of the right tree, even if the subtrees are not complete. 



Both the three mixed (A, C, and D) 
and the four homogeneous (E, F, G, and 
H) let-7 clusters clearly are the result 
of the vertebrate-specific (2R) genome 
duplications. 

The situation is more complex in the 
five teleosts due to an extra round of 
genome duplication. The fish-specific 
genome duplication (FSGD) preceeded 
the divergence of the teleosts. 53 ' 54 Com- 
bining synteny information and sequence 
comparison allows to resolve the orthology 
relationships of the let-7 loci among the 
teleosts, see Table 2. 

The correspondence of tetrapod and 
teleost let-7 clusters cannot be determined 
based on sequence similarity alone due to 
the short sequences and the large phylo- 
genetic distances. For most loci, strong 
support comes again from synteny infor- 
mation. The teleost Aa locus, in particular, 
shares several flanking protein-coding 
genes with the human A locus, e.g., 
SMG9 and HAS1. We note, however, 
that the sequences of the teleost Aa and 
Ab loci are not recognizable as orthologs 
of the tetrapod A locus, while synteny 
and sequence data are largely consistent 
for the other loci. There is no support for 
the alternative explanation either, namely 
that the teleost clusters Aa and Ab derive 
really from an ancestral gnathostome B 
locus that has been lost completely in 
Tetrapoda, while the A locus has com- 
pletely disappeard in teleosts. Taken 



together, thus, the data suggest an ances- 
tral state prior to the FSGD that closely 
matches the ancestral state in gnathos- 
tomes, see Figure 2. The only changes that 
can be attributed to the actinopterygian 
stem lineages are the loss of A-mir-100 
and E-let-7-3. Surprisingly, the loci I and 
J are found in a loose association with 
the H and G clusters. In the case of G/I, 
this association is found in both paralogs, 
implying that the proximity of G and I 
loci preceeded the FSGD. The G and I 
loci are also found on the same chromo- 
somes, although separated by many mega- 
bases, in rat, dog, cow, sheep, and in 
sauropsids. 

During gnathostome evolution, we 
observe several clade-specific loss events 
of entire clusters and individual micro- 
RNAs, cf. Figure 2. The most dramatic 
reductions occur in the wake of the FSGD 
with the complete loss of one copy of 
clusters E and J, the subsequent loss of 
the other copy of E in the percomorph 
lineage (pufferfishes, medaka, and stickle- 
back), and the deletion of the Ca cluster 
in pufferfishes. Aves lack both the A and 
the F clusters, both of which are still 
present in the lizard genome. This could 
be due to the bird genome assemblies, 
which are, however, known to be incom- 
plete in particular in their coverage of the 
micro-chromosomes. 55 Among mammals, 
only platypus features an H cluster, while 
this locus is lost in all Theria. Other 



missing clusters in Eutheria affect in parti- 
cular low coverage genomes and might be 
explained better by an incomplete assem- 
bly. A conspicuous pattern is the lack of 
the A cluster in the lemurs, however. A 

Table 2. Correspondence of let-7 loci in the teleost 
genomes of Danio rerio (dre), Oryzia latipes (ola), 
Gasterosteus aculeatus (gac), Takifugu rubripes (tru), 
and Tetraodon nigrovirides (tni) 



loc. 


dre 


ola 


gac 


tru 


tni 


Aa 


16 27M 


16 ,2M 


XX™ 


s.22 


06M 


Ab 


1 glOM 


s.1995 




s.37 


21-rnd 


Ca 


1Q 39M 


-| 4 1M 


VII 1 ™ 






Cb 


1 ^29M 








- 


Da 


1 Cj20M 


1 3 2M 




s.1 44 


16™ 


Db 


C31M 




vir™ 


s.6 


76M 


Ea 


1 1 28M 










Ha 


£54M 


C12M 


XVII 1 ™ 


s.7484 


11™ 


Ja 


£41M 


5 7M 


c.7697 


s.56 


11™ 


Hb 


23™ 




XII 1 ™ 


s.93 


g4.6M 


Fa 




y12M 


XII 11M 


s.66 


g3.6M 


Fb 




y26M 


XII 1 ™ 


s.2159 


g!OM 


Ga 


25™ 


6 6M 


XIX 4 ™ 


s.2 


1 g9.2M 


la 


251.6M 


g5.8M 


XIX 4 ™ 


s.2 




Gb 


^1 7M 


2^6M 


IV 1 ™ 


s.1 77 


■J g53M 


lb 




s.1 942 


|y20M 


s.1 77 


1 g5.4M 


Synteny 


was 


determined from the 


pairwise 



alignment nets 19 provided through the UCSC 



genome browser. Note that the genomic coordi- 
nates of each loci are abbreviated by the 
chromosome or scaffold number and their 
position on megabase scale in superscript. 
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Figure 2. Putative evolutionary history of the let-7 microRNA clusters across Bilateria. A white triangle represents a mir-100, a gray triangle represents 
mir-125 microRNAs while let-7 sequences are depicted by black triangles. Annotated lin-4 microRNAs are shown with circles. 1R/2R denote two rounds of 
whole genome duplications, whereas FSGD labels the additional teleost-specific genome duplication. The duplicate clusters in teleosts are highlighted 
by different shading. Entire lineages are written in bold, genera are written in italic. Dispersed, highly derived let-7 and mir-125 paralogs 
in platyhelminthes and in Brugia are not shown. 



gain of new let-7 loci is observed only in 
primates. The L locus appears in 
Haplorhini (tarsier and monkeys), while 
the K locus is present in Catarrhini (old 
world monkeys) only. We found evidence 
of expression of the miR sequence of K 
and L loci in miRNA-seq data of Human 
and Rhesus macaque brain samples 56 as 
well as in small RNA-seq data of the 
ENCODE cell lines. 57 

The ancestral let-7 clusters were appar- 
ently tightly linked as one would expect 
from a polycistronic primary transcript. 
However, some of these distances in mixed 
clusters substantially increase in tetrapods, 
namely, D-mir-100/D-let-7, C-let-7/C- 
mir-125, and D-let-7/D-mir-125, see also 
Figure 3. In both cases, the entire clusters 
are contained in the introns of non-coding 



primary precursors, in human known as 
LINC00478 (C-cluster) and MIR100HG 
(D-cluster), respectively. Cluster F is 
expressed from an intron of the coding 
HUWE1 gene throughout and locus J is 
conserved within an intron of the coding 
WDR82 gene. Cluster G, in contrast, is 
exonic, located in the 3' exon of the non- 
coding host gene MIRLET7BHG. Loci 
E and I are associated with clusters of 
unspliced ESTs, the expression of cluster 
A cannot be resolved from currently 
available data. 

Interestingly, human MIRLET7BHG 
harbors in one of its introns the addi- 
tional annotated microRNA mir-3619 
where evidence of expression was found 
in small RNA-seq data from embryonic 
stem cells. 58 This is an evolutionarily 



young innovation, present only in old 
world monkeys. 

By the use of family-wide covariance 
models of all let-7 families, clear evidence 
for the close relationships of let-7 micro- 
RNA of mixed clusters can be found, c.f. 
Figure 4. Furthermore, all let-7-1 and 
let-7— 2 cluster appear to be closely related 
to each other corroborating their origin 
by genome duplications. 

Let-7 in Basal Deuterostomes. Cyclo- 
stomia, lampreys and hagfishes, share at 
least one and possibly both rounds of the 
vertebrate genome duplication. Genomic 
data are solely available for the lamprey 
Petromyzon marinus. The miRBase lists 7 
let-7, 3 mir-100, and a single mir-125, 59 
not all of which can be recovered from 
the available genome assembly. The 
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Figure 3. Distances between the pairs of adjacent pairs of microRNAs (mir-100/let-7 and let-7/mir- 
125) in the mixed clusters A, C, and D are conserved across Mammalia. Locus A is quite compact 
throughout the gnathostomes. Cluster D, in contrast, shows consistently large distances in 
tetrapods. In the cat Felis catus, distance outliers are owed to large gaps in the assembly within 
clusters. 



pma-mir-lOOc sequence, however, is the 
reverse complement of pma-mir-lOOa. 

The mir-lOOa and mir-125 genes are 
genomically linked and hence derive 
from the ancestral mixed cluster, although 
the corresponding let-7 expected to lie 
between them is missing. The presence of 
pma-mir-lOOb (without a linked mir-125 
paralogs) serves as a witness of at least one 
round of genome duplication. Based on 
the similarities measured with the help of 
covariance models, we can also identify 
pma-let-7a-4 as descendant of the let-7 
located in the ancestral mixed cluster. 
It seems to be the only lamprey let-7 
miRNA originating from a mixed cluster. 
However, the assignment of the corres- 
ponding cluster is not possible since the 
sequence is not mappable to any available 
genome assembly. Among the homo- 
geneous clusters, pma-let-7d and pma-let- 
7a-3 form a cluster; for the other loci the 




Figure4. Heatmap illustrating the structural and sequence similarity of all let-7 families. Therefore, all let-7 microRNA sequences of each family were 
scored against each family-wide covariance model. The average bitscore of covariance model (row) vs. let-7 family (column) is visualized in a color 
gradient. The standard deviations of these bitscores is always below 15 with a median bitscore standard deviation of 3.4. Due to the sequence 
divergence of cluster A in tetrapods and Aa/Ab in teleost fishes, both lineages were analyzed separately. 
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Figure 5. Highly derived mir-100 paralogs in basal deuterostomes. Although the seed regions also contain substitutions, the homology is clearly visible. 



genome assembly is too fragmented to 
be informative. By applying Infernal to 
these miRNAs, strong evidence is obtained 
that P. marinus contains at least 2 homo- 
geneous clusters comprising both a let-7-1 
and a let-7-2 paralog. One of the two 
remaining sequences, pma-let-7c, shows 
also obvious characteristics of a let-7-1 
subfamily, whereas the other one, pma-let- 
7a- 1, reveals no clear features to make a 
precise assignment of its origin. Sequenc- 
ing of short RNAs also shows the presence 
of, presumably, multiple copies of mir- 
100, mir-125, and let-7 in the genome of 
the hagfish Myxine glutinosa. 59 

In Ciona intestinalis, an intron of an 
EST cluster that is homologous to the 
HUWE1 protein harbours a closely spaced 
cluster of four copies of let-7. On a 
different chromsome, there is a single 
mixed cluster consisting of mir-1473, let- 
7d, and mir-125. A very similar arrange- 
ment is found in Ciona savignyi. Upon 
closer inspection, cin-mir-1473 and csa- 
mir-1473 are clearly homologs of mir-100 
revealing an ancestral mixed cluster, see 
Figure 5 . In Oikopleura dioica, one locus 
harbours a let-7 and a mir-l473/mir-100 
ortholog, 4 a second locus consists of two 
let-7 paralogs, and a final copy of let-7 is 
found on a third scaffold. The cin-let-7a-l 
and cin-let-7a-2, as well as their counter- 
parts csa-let-7c-l and csa-let-7c-2, appear 
to be homologs of the let-7-2 and let-7-1 
subfamily, respectively. In C. intestinalis, 
let-7a-l is the first miRNA in the homo- 
geneous let-7 cluster while let-7a-2 is 
located at the end. On the other hand, 
both C. savignyi sequences are located at 
the beginning. In O. dioica, however, 
solely odi-let-7c located on scaffold 10 



appears to be assignable to the let-7-1 
subfamily whereas the corresponding 
member of the let-7-2 subfamily is 
missing. Both cin-let-7c and its ortholog 
csa-let-7a might be copies of the let-7 
miRNA of the mixed cluster. Unfortu- 
nately, the other orthologous miRNAs 
cin-let-7b and csa-let-7b cannot clearly be 
assigned to any let-7 subfamily by the use 
of covariance models, although their 5p- 
miR appears to be closely related to let-7 
miRNAs originating from mixed clusters. 
Nevertheless, both remaining O. dioica 
sequences, namely odi-let-7a and odi-let- 
7b, are neither unambiguously assignable 
to any ascidian let-7 miRNA nor to 
any other subfamily. The relationships of 
the let-7 loci in basal deuterostomes are 
summarized in Figure 6. 



In the lancelet Branchiostoma floridae, 
there is a second copy of let-7 located at 
the 3' end of the canonical cluster. 60 The 
two bfl-let-7 precursor sequences differ by 
only 4 point mutations, and hence are 
probably the results of a lineage-specific 
duplication. The ancestral cluster mir-100/ 
let-7/mir-125 is also present in ambula- 
crarians, i.e., the acorn worm (hemichor- 
data) and the sea urchin (echinodermata). 
In the latter, the mir-100 ortholog is 
rather diverged and recorded as spu-mir- 
2003 in miRBase. 

The most basal clade of Deuterostomia, 
recently termed Xenacoelomorpha, 61 is 
composed of Xenoturbellida and Acoelo- 
morpha. For Xenoturbella bocki, mature 
sequences of mir-125, let-7, and mir-100 
have been reported. 61 In contrast, no 



Figure 6. Relationships of let-7 loci in the basal deuterostomes P. marinus (pma), 0. dioica (odi), 
C. savignyi (csa), and C. intestinalis (cin). Assignments to certain let-7 subfamilies were made with 
the use of Infernal. For details, see text. 
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evidence for any of the three microRNA 
families were found in microRNA libaries 
of Hofstenia miamia Gl and Symsagittifera 
roscoffensis. 2S ' 62 A survey of several acoels 
by northern blot also returned a negative 
result. 11 

Antisense microRNAs and cluster 
extensions. MicroRNA precurors are very 
stable hairpin structures so that their 3' 
and 5' halves are close to being reverse 
complements. An antisense transcript 
therefore will in general also give rise to 
pre-mir-like hairpin structure. Indeed, 
functional antisense microRNAs have 
been reported in the literature for serveral 
loci, see e.g. Bender. 63 It does not come 
as a surprise, therefore, that antisense 
microRNAs have been found in deep 
sequencing data for some of the let-7 
loci, cf. Table 3. In this survey, however, 
microRNAs that are located antisense 
to miRBase annotated mir-100, mir-125, 
or let-7 microRNAs were not further 
investigated. 

Hairpins of about the size of pre- 
microRNAs are among the most frequent 
secondary structure motifs. This provides 
a likely mechanism for the innovation of 
new microRNAs. 6 The current version of 
miRBase reports several cases of addi- 
tional microRNAs that emerged within 
or closely adjacent to a let-7 cluster. The 
best conserved example is hsa-mir-4763 64 
located in the G cluster between hsa-Iet- 
7a-3 and hsa-let-7b. The human miR 
sequence is fairly well conserved in 
primates and to some extent in eutheria, 
see the corresponding alignment in the 
supplement material. However, there was 
no clear evidence of expression of the 
annotated hsa-mir-4763 in brain samples 56 
or ENCODE cell lines. 57 Furthermore, 
conserved orthologs in Macaco, mulatta 



and Cants familaris revealed no expression 
in miRNA-seq data from rhesus macaque 
brain 56 or in small RNA-seq data from 
domestic dog lymphocytes, 65 respectively. 
The cow microRNA bta-mir-2443, that 
is also located in cluster G between both 
let-7 microRNAs, is not an ortholog of 
the hsa-mir-4763 sequence. There is 
evidence for its expression in small RNA 
libraries of bovine kidney cells. 66 An 
interesting finding is the only detectable 
ortholog found in the dolphin Tursiops 
truncatus. This microRNA is located up- 
stream of the mir-4763 ortholog and of 
the G-let-7-2 sequence, while the corres- 
ponding let-7- 1 sequences of cluster G is 
missing. In Bombyx mori, bmo-mir-2795 
is inserted between let-7 and mir-100. 67 A 
search of the NCBI databases did not 
reveal homologs in other insects. In the 
zebra finch, finally, tgu-mir-2987 is found 
about 3.1 kb downstream of the E cluster. 
Its sequence is not conserved in other bird 
genomes. 

I 

Discussion 

The association of mir-100, mir-125, and 
let-7 with its key conserved function in 
developmental timing 32 is one of the 
evolutionarily most ancient systems of 
microRNA-based regulation. The ancestral 
cluster of these three microRNAs dates 
back to the advent of Bilateria. In fact, 
only mir-100 and mir-10 date back 
further and are common to Eubilateria. 3 " 5 
The evolution of the let-7 cluster in 
Protostomia is characterized mostly by 
partial losses and only occasional gene 
duplications (e.g., mir-100 in Brugia and 
mir-125 in platyhelminthes). In contrast, 
early chordates have acquired a second 
let-7 locus that subsequently expanded by 



tandem duplication. The vertebrate- 
specific genome duplications expanded 
this system to a large number of para- 
logous loci. The retention rate of these 
paralogs is rather high, with up to 20 let-7 
cluster microRNAs present in extant 
tetrapods, compared with the ancestral 
24 microRNAs that are inferred from 
two rounds of duplications of the two 
chordate clusters. This is comparable with 
the fate of important transcriptional 
regulators such as the HOX gene clus- 
ters, 68 ' 69 while the redundancy generated 
by genome duplications is nearly comple- 
tely resolved, e.g., for metabolic enzymes. 

The detailed analysis of the let-7 family 
also shows that microRNAs are always as 
conserved as one might expect. Beyond 
loss events, we also found highly derived 
paralogs that by combination of synteny 
and sequence similarity are unambigously 
recognizable as homologs. The best 
examples are the homology of lin-4 and 
mir-125 in nematodes and the mir-100 
paralogs mir-1473 (tunicates) and mir- 
2003 (echinoderms). This observation 
suggests that undocumented homologies 
are present also among other annotated 
microRNA families and it has an impact 
on the use of microRNAs as a phylo- 
genetic marker as unrecognized derived 
microRNA families can be misinterpreted 
as the ancestral state in which the micro- 
RNA family has not yet emerged. 

The naming convention of miRBase 
for paralogous microRNAs has turned out 
to be a major technical inconvenience 
for the present study. True orthologs (as 
determined by both synteny and sequence 
comparison of the complete precursor 
sequences) not infrequently have different 
names in different species. Even worse, 
paralogous copies may have the same 
name. It would be desirable, therefore, to 
rethink the naming schemes to convey 
information on the genomic location. For 
vault RNAs, which also form multiple 
clusters in mammalian genomes genes, 
names that make the cluster membership 
explicit were recently adopted by the 
HGNC. 70 
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Table 3. Antisense microRNAs associated with let-7 loci listed in miRBase v.18 



Species 


loc. 


sense 


antisense 


Ref. 
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68 
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A 
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rno-mir-3596c 


68 
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68 
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rno-mir-3596b 


68 
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C 


mo-mir-125b-2 


rno-mir-3588 


68 
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G-2 


bta-let-7b 


bta-mir-3596 


69 


lancelet 




bfl-let-7a-1 
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70 
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bfl-mir-125a 
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